<html><head><style>body {
   color: black;
}
</style></head><body><h1 id="data-biography-us-state-public-health-agency-pandemic-tweets-with-cci-labels-2020-">Data Biography: US State Public Health Agency Pandemic Tweets with CCI Labels (2020)</h1>
<p><strong>Author:</strong> Samuel R. Mendez (they/them/theirs), PhD candidate at Harvard T.H. Chan School of Public Health. <strong>Date:</strong> September 20, 2024. <strong>Credit:</strong> This document is based on the <a href="https://weallcount.com/2019/01/21/an-introduction-to-the-data-biography/">We All Count data biography template</a>.</p>
<p>This document provides contextual information about the creation of the dataset &quot;cci_state_agencies_2020.csv&quot;, which comprises text from Tweets by US state public health agencies about the COVID-19 pandemic, published between January 1 and December 11, 2020. This document is in a Q&amp;A format, with each question written as a Level 3 heading. For more details on specific variables, see the associated data dictionary.</p>
<h2 id="who">Who</h2>
<p>This section provides details on the individuals and organizations involved in creating, collecting, labeling, and accessing the data.</p>
<h3 id="who-collected-the-data-">Who collected the data?</h3>
<p>Samuel R. Mendez (they/them/theirs), PhD candidate at Harvard T.H. Chan School of Public Health, collected the data. This included writing R script, identifying US state public health agency accounts, and preparing data for labeling, analysis, and distribution. Samuel accessed the data via the Twitter API for Academic Research (v2).  Samuel created this version of the dataset for distribution in line with Twitter&#39;s Terms of Service, including only custom variables and the publicly funded, public-record text content.</p>
<h3 id="who-provided-the-data-">Who provided the data?</h3>
<p>Text data come from Tweets by US state public health agency Twitter accounts and COVID-19 pandemic response Twitter accounts affiliated with those agencies. At the time of data collection, Samuel identified public health agency Twitter accounts for all 50 states. Samuel found all but Wyoming had published at least 1 Tweet during the study period.</p>
<p>Though organizational authorship is clear, individual authors/contributors are unknown. Due to the prolonged, complex nature of the COVID-19 public health emergency in the US, it is possible that these Tweets come from a different mix of individual contributors than Tweets on other topics or from preceding time periods by the same organizational authors.</p>
<h3 id="who-labeled-the-data-">Who labeled the data?</h3>
<p>Data labels come from a pair of raters trained in the use of the <a href="https://www.cdc.gov/ccindex/index.html">CDC Clear Communication Index</a>. Samuel trained these two raters, a PhD student and MPH student studying communication at the Harvard TH Chan School of Public Health, in applying the CDC Clear Communication Index to social media posts.</p>
<h3 id="who-has-access-to-data-">Who has access to data?</h3>
<p>The text, produced by publicly funded US state health agencies, is available in this Harvard Dataverse repository. Associated metadata is only available to those with Twitter API access replicating methods found in an <a href="https://osf.io/mgefr/">OSF repository</a> for a project associated with this data.</p>
<h2 id="what">What</h2>
<p>This section provides an overview of the data collected, including the type of data, any gaps in the collection, and specific considerations for understanding its scope and content.</p>
<h3 id="what-types-of-data-were-collected-">What types of data were collected?</h3>
<p>Using Twitter API access, we collected the full text of Tweets from US state health agencies and pandemic response Twitter accounts they managed, from January 1 through December 10, 2020, in English. We included only Tweets containing a match for regular expressions designed to capture variations of pandemic keywords: corona, COVID, distancing, face cover, face mask, #FlattenTheCurve, lockdown, #MaskUp, nCoV, pandemic, Paycheck Protection, quarantine, SARS-CoV-2, #StopTheSpread, Warp, and Wuhan. We used Tweet metadata available through the API to filter based on Language, date, and author. </p>
<h3 id="what-sampling-method-was-used-to-create-this-data-set-">What sampling method was used to create this data set.</h3>
<p>From our corpus, we created a stratified random sample by month and state to produce the data subset for labeling.</p>
<h3 id="what-additional-processing-did-the-data-undergo-">What additional processing did the data undergo?</h3>
<p>We finalized our corpus on May 1, 2023, using academictwitteR to retrieve the full text of Tweets referenced by Retweets and Quote Tweets, as identified by metadata in the API download. We modified these Tweets to simulate their appearance in users’ timelines rather than potentially abridged versions in the original API download. Modified Retweets contained the full text of referenced Tweets. Modified Quote Tweets contained the full text of referenced Tweets, appended to commentary from state health agency accounts.</p>
<h3 id="what-do-the-labels-represent-">What do the labels represent?</h3>
<p>The labels represent judgements about whether the text meets guidelines from the CDC Clear Communication Index. In the training data, the labels represent one rater&#39;s judgement. In the validation data, the labels represent a consensus judgement between the two raters (with conflicts resolved by a third rater)</p>
<h2 id="where">Where</h2>
<p>This section provides information about the geographical locations and institutions involved in the data collection, storage, and distribution processes.</p>
<h3 id="where-was-data-collected-">Where was data collected?</h3>
<p>The data was collected in the US.</p>
<h2 id="when">When</h2>
<p>This section provides timelines associated with the data collection and distribution process.</p>
<h3 id="when-was-data-collected">When was data collected</h3>
<p>The Tweet data was collected in April 2023.</p>
<h3 id="when-was-data-labeled">When was data labeled</h3>
<p>The data was labeled from June through August 2023.</p>
<h2 id="why">Why</h2>
<p>This section provides insights into the purpose of data collection and sharing.</p>
<h3 id="what-was-the-primary-data-production-purpose-">What was the primary data production purpose?</h3>
<p>The data come from US state public health agencies doing outreach on Twitter during the first 9 months of the COVID-19 pandemic. We did not reach out to agencies for information about the Tweets. But we were willing to assume that all tweets in this dataset were part of health agencies&#39; efforts to mitigate the population-level harm of the COVID-19 pandemic.</p>
<h3 id="what-was-the-primary-data-collection-purpose-">What was the primary data collection purpose?</h3>
<p>The data were collected to support Samuel R. Mendez&#39;s dissertation research in the Department of Social and Behavioral Sciences at the Harvard TH Chan School of Public Health, finished in May 2025. The data were part of a larger download, collected to enable multiple research studies related to health literacy, health communication, and social media.</p>
<p>The full text of referenced Tweets, as noted above, was collected specifically for a project examining the suitability of different machine learning methods to scale up the CDC Clear Communication Index.</p>
<h3 id="is-there-any-economic-political-or-personal-advantage-to-data-skewing-in-certain-direction-">Is there any economic, political, or personal advantage to data skewing in certain direction?</h3>
<p>There would be an advantage to public health agencies if we found, through data labeling, that their social media posts were overwhelmingly following CDC clear communication guidelines. There would be an advantage to the authors if the data were more balanced, in terms of efficient training machine learning models.</p>
<h3 id="is-this-data-collected-with-informed-consent-">Is this data collected with informed consent?</h3>
<p>No. As our corpus consists of publicly viewable texts made by publicly funded agencies in the course of public service, informed consent was not necessary. We also did not deem it necessary to inform state public health agencies of our data collection or analysis.</p>
<h2 id="how">How</h2>
<p>This section provides insights into the methods, tools, and frameworks used for data collection and distribution.</p>
<h3 id="what-tools-were-used-to-collect-the-data-">What tools were used to collect the data?</h3>
<p>We used R and Rstudio to collect the data, via the academictwitteR package.</p>
<h3 id="what-tools-were-used-to-label-the-data-">What tools were used to label the data?</h3>
<p>The data labeling took place in a shared Google drive, where Tweets processed for labeling were uploaded as CSV files, in folders unique to each labeler. Labelers were free to use whatever tools they wanted to label the Tweets as long as their outputs were available in a CSV on the same Drive folder.</p>
<h3 id="what-level-of-ownership-are-providers-being-given-">What level of ownership are providers being given?</h3>
<p>The public-record text from US state health agencies, along with the labels we created for our research, are made available under a Public Domain license. The original organizational authors have the same access to them as anyone else able to access Harvard Dataverse.</p>
</body></html>